144 research outputs found

    Hybrid Words Representation for the classification of low quality text

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Language enables humans to communicate with others. For instance, we talk, give our opinions and suggestions all using natural language; to be more precise, we use words while communicating with others. However, in today's world, we wish to communicate with computers, just like humans. It is not an easy task because human communicate in an unstructured and informal way, whereas computers need structured and clean data. So it is essential for computers to understand and classify text accurately for proper human-computer interactions. For classifying a text, the first question we must address is how to improve the low-quality text. The next immediate challenge is to have the best representation so that text can be classified accurately. The way text is organized reflects polysemy, semantic and syntactical coupling relationships which are embedded in its contents. The effective capturing of such content relationships is thereby crucial for a better understanding of text representations. This is especially challenging in the environments where the text messages are short, informal and noisy, and involves natural language ambiguities. The existing sentiment classification methods are mainly for document and clean textual data which can not capture relationship, different attributes and characteristics within tweet messages. Social media analysis, especially the analysis of tweet messages on Twitter has become increasingly relevant since the significant portion of data is ubiquitous in nature. The social media-based short text is valuable for many good reasons, explored increasingly in text analysis, social media analysis and recommendation. In the same time, there is a number of challenges that need to be addressed in this space. One of the main issues is that the traditional word embeddings are unable to capture polysemy (assigns the same representation of a word irrespective of its context and meaning) and out of vocabulary words (assigns a random representation). Furthermore, traditional word embeddings fail to capture sentiment information of words which results in similar word vector representations having the opposite polarities. Thus, ignoring polysemy within the context and sentiment polarity of words in a tweet reduces the performance for tweets classification. In order to address the above-mentioned research challenges and limitations associated with word-level representations, this thesis focuses on improving the representation of low-quality text by improving the unstructured and informal nature of tweets to utilize the information thoroughly and manages the natural language ambiguities to build a more robust sentiment classification model. As compared to previous studies, the proposed models can deal with the ubiquitous nature of the short text, polysemy, semantic and syntactical relationships within a content, thereby addressing the natural language ambiguity problems. Chapter 4 presents the effects of pre-processing techniques using two different word representation models with the machine and deep learning classifiers. Then, we present our recommended combination (approach) of different pre-processing techniques which improves the low quality, by performing sentiment-aware tokenization, correction of spelling mistakes, word segmentation and other techniques to utilize most of the information hidden in unstructured text. The experimental result shows that the proposed combination performs well as compared to other combinations. Chapter 5 presents the hybrid words representation. In this chapter, we proposed our Deep Intelligent Contextual Embedding for Twitter sentiment analysis. Proposed model addresses the natural language ambiguities and is devised to capture polysemy in context, semantics, syntax and sentiment knowledge of words. Bi-directional Long-Short Term Memory wth attention is employed to determine the sentiment. We evaluate the proposed model by performing quantitative and qualitative analysis. The experimental results show that the proposed model outperforms various word embedding models in the sentiment analysis of tweets. Above mentioned methods can be applied to any social media classification task. The performance of proposed models is compared with different models which support the effectiveness of the proposed models and bound the information loss in their generated high-quality representations

    A Robust Variable Step Size Fractional Least Mean Square (RVSS-FLMS) Algorithm

    Full text link
    In this paper, we propose an adaptive framework for the variable step size of the fractional least mean square (FLMS) algorithm. The proposed algorithm named the robust variable step size-FLMS (RVSS-FLMS), dynamically updates the step size of the FLMS to achieve high convergence rate with low steady state error. For the evaluation purpose, the problem of system identification is considered. The experiments clearly show that the proposed approach achieves better convergence rate compared to the FLMS and adaptive step-size modified FLMS (AMFLMS).Comment: 15 pages, 3 figures, 13th IEEE Colloquium on Signal Processing & its Applications (CSPA 2017

    EMPIRICAL TESTING OF HEURISTICS INTERRUPTING THE INVESTOR’S RATIONAL DECISION MAKING

    Get PDF
    The study aimed to investigate the impact of behavioral biases on investor’s financial decision making. Current research studies the behavioral biases including overconfidence, confirmation, and illusion of control, loss aversion, mental accounting, status quo and excessive optimism. The study is significant for the investors, policy makers, investment advisors, and bankers. Empirical data has been collected through administrating a questionnaire. Correlation and Linear regression model techniques are used to investigate whether investor decision making is affected by these biases. The study concluded that the Confirmation, Illusion of control, Excessive optimism, Overconfidence biases have direct impact on the investor’s decision making while status quo, Loss aversion and Mantel accounting biases have no impact according to data collected from financial institutions

    Autologous hematopoietic stem cell transplantation-10 years of data from a developing country

    Get PDF
    Intensive chemotherapy followed by autologous stem cell transplantation is the treatment of choice for patients with hematological malignancies. The objective of the present study was to evaluate the outcomes of patients with mainly lymphoma and multiple myeloma after autologous stem cell transplant. The pretransplant workup consisted of the complete blood count, an evaluation of the liver, kidney, lung, and infectious profile, chest radiographs, and a dental review. For lymphoma, all patients who achieved at least a 25% reduction in the disease after salvage therapy were included in the study. Mobilization was done with cyclophosphamide, followed by granulocyte colony-stimulating factor, 300 g twice daily. The conditioning regimens included BEAM (carmustine, etoposide, cytarabine, melphalan) and high-dose melphalan. A total of 206 transplants were performed from April 2004 to December 2014. Of these, 137 were allogeneic transplants and 69 were autologous. Of the patients receiving an autologous transplant, 49 were male and 20 were female. Of the 69 patients, 26 underwent transplantation for Hodgkin\u27s lymphoma, 23 for non-Hodgkin\u27s lymphoma, and 15 for multiple myeloma and 4 and 1 for Ewing\u27s sarcoma and neuroblastoma, respectively. The median age ± SD was 34 ± 13.1 years (range, 4-64). A mean of 4.7 * 10⁸ ± 1.7 mononuclear cells per kilogram were infused. The median time to white blood cell recovery was 18.2 ± 5.34 days. Transplant-related mortality occurred in 10 patients. After a median follow-up period of 104 months, the overall survival rate was 86%. High-dose chemotherapy, followed by autologous stem cell transplant, is an effective treatment option for patients with hematological malignancies, allowing further consolidation of response

    A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models

    Full text link
    Word representation has always been an important research area in the history of natural language processing (NLP). Understanding such complex text data is imperative, given that it is rich in information and can be used widely across various applications. In this survey, we explore different word representation models and its power of expression, from the classical to modern-day state-of-the-art word representation language models (LMS). We describe a variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs. These models can transform large volumes of text into effective vector representations capturing the same semantic information. Further, such representations can be utilized by various machine learning (ML) algorithms for a variety of NLP related tasks. In the end, this survey briefly discusses the commonly used ML and DL based classifiers, evaluation metrics and the applications of these word embeddings in different NLP tasks

    Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

    Get PDF
    The availability of biomedical text data and advances in natural language processing (NLP) have made new applications in biomedical NLP possible. Language models trained or fine tuned using domain specific corpora can outperform general models, but work to date in biomedical NLP has been limited in terms of corpora and tasks. We present BioALBERT, a domain-specific adaptation of A Lite Bidirectional Encoder Representations from Transformers (ALBERT), trained on biomedical (PubMed and PubMed Central) and clinical (MIMIC-III) corpora and fine tuned for 6 different tasks across 20 benchmark datasets. Experiments show that BioALBERT outperforms the state of the art on named entity recognition (+11.09% BLURB score improvement), relation extraction (+0.80% BLURB score), sentence similarity (+1.05% BLURB score), document classification (+0.62% F1-score), and question answering (+2.83% BLURB score). It represents a new state of the art in 17 out of 20 benchmark datasets. By making BioALBERT models and data available, our aim is to help the biomedical NLP community avoid computational costs of training and establish a new set of baselines for future efforts across a broad range of biomedical NLP tasks

    Text Mining of Stocktwits Data for Predicting Stock Prices

    Get PDF
    Stock price prediction can be made more efficient by considering the price fluctuations and understanding the sentiments of people. A limited number of models understand financial jargon or have labelled datasets concerning stock price change. To overcome this challenge, we introduced FinALBERT, an ALBERT based model trained to handle financial domain text classification tasks by labelling Stocktwits text data based on stock price change. We collected Stocktwits data for over ten years for 25 different companies, including the major five FAANG (Facebook, Amazon, Apple, Netflix, Google). These datasets were labelled with three labelling techniques based on stock price changes. Our proposed model FinALBERT is fine-tuned with these labels to achieve optimal results. We experimented with the labelled dataset by training it on traditional machine learning, BERT, and FinBERT models, which helped us understand how these labels behaved with different model architectures. Our labelling method competitive advantage is that it can help analyse the historical data effectively, and the mathematical function can be easily customised to predict stock movement

    Frequency and outcome of graft versus host disease after stem cell transplantation: A Six-Year Experience from a Tertiary Care Center in Pakistan

    Get PDF
    Objective: The objective of this study was to evaluate the frequency and outcome of graft versus host disease after stem cell transplantation for various haematological disorders in Pakistan. Materials and Methods. Pretransplant workup of the patient and donor was performed. Mobilization was done with G-CSF 300 mu g twice daily for five day. Standard GvHD prophylaxis was done with methotrexate 15mg/m(2) on day +1 followed by 10mg/m(2) on days +3 and +6 and cyclosporine. Grading was done according to the Glucksberg classification. Results. A total of 153 transplants were done from April 2004 to December 2011. Out of these were allogeneic transplants. There were females and males. The overall frequency of any degree of graft versus host disease was 34%. Acute GvHD was present in patients while had chronic GvHD. Grade II GvHD was present in patients while grade III and IV GvHD was seen in patients each. Acute myeloid leukemia and chronic myeloid leukemia were most commonly associated with GvHD. The mortality in acute and chronic GvHD was 8.8% and 12% respectively. Conclusion. The frequency of graft versus host disease in this study was 34% which is lower compared to international literature. The decreased incidence can be attributed to reduced diversity of histocompatibility antigens in our population

    A Multi-Modal Dataset for Hate Speech Detection on Social Media: Case-study of Russia-Ukraine Conflict

    Get PDF
    Hate speech consists of types of content (e.g. text, audio, image) that express derogatory sentiments and hate against certain people or groups of individuals. The internet, particularly social media and microblogging sites, have become an increasingly popular platform for expressing ideas and opinions. Hate speech is prevalent in both offline and online media. A substantial proportion of this kind of content is presented in different modalities (e.g. text, image, video). Taking into account that hate speech spreads quickly during political events, we present a novel multimodal dataset composed of 5680 text-image pairs of tweets data related to the Russia-Ukraine war and annotated with a binary class:”hate” or”no-hate” The baseline results show that multimodal resources are relevant to leverage the hateful information from different types of data. The baselines and dataset provided in this paper may boost researchers in direction of multimodal hate speech, mainly during serious conflicts such as war contexts

    Benchmarking for Public Health Surveillance tasks on Social Media with a Domain-Specific Pretrained Language Model

    Get PDF
    A user-generated text on social media enables health workers to keep track of information, identify possible outbreaks, forecast disease trends, monitor emergency cases, and ascertain disease awareness and response to official health correspondence. This exchange of health information on social media has been regarded as an attempt to enhance public health surveillance (PHS). Despite its potential, the technology is still in its early stages and is not ready for widespread application. Advancements in pretrained language models (PLMs) have facilitated the development of several domain-specific PLMs and a variety of downstream applications. However, there are no PLMs for social media tasks involving PHS. We present and release PHS-BERT, a transformer-based PLM, to identify tasks related to public health surveillance on social media. We compared and benchmarked the performance of PHS-BERT on 25 datasets from different social medial platforms related to 7 different PHS tasks. Compared with existing PLMs that are mainly evaluated on limited tasks, PHS-BERT achieved state-of-the-art performance on all 25 tested datasets, showing that our PLM is robust and generalizable in the common PHS tasks. By making PHS-BERT available, we aim to facilitate the community to reduce the computational cost and introduce new baselines for future works across various PHS-related tasks.Comment: Accepted @ ACL2022 Workshop: The First Workshop on Efficient Benchmarking in NL
    corecore